Community-driven data grids

نویسندگان

  • Tobias Scholl
  • Alfons Kemper
چکیده

E-science communities and especially the astronomy community have put tremendous efforts into providing global access to their distributed scientific data sets to foster vivid data and knowledge sharing within their scientific federations. Beyond already existing huge data volumes, the collaborative researchers face major challenges in managing the anticipated data deluge of forthcoming projects with expected data rates of several terabytes a day, such as the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS), the Large Synoptic Survey Telescope (LSST), or the Low Frequency Array (LOFAR). In this thesis, we describe and investigate community-driven data grids as an e-science data management solution. Community-driven data grids target at domain-specific federations and provide a scalable, distributed, and collaborative data management. Our infrastructure optimizes the overall query throughput by employing dominant data characteristics (e. g., data skew) and query patterns. By combining well-established techniques for data partitioning and replication with Peer-to-Peer (P2P) technologies, we can address several challenging problems: data load balancing, efficient data dissemination and query processing, handling of query hot spots, and the adaption to short-term query bursts as well as long-term load redistributions. We propose a framework for investigating application-specific index structures to create locality-aware partitioning schemes (so-called histograms) and to find appropriate data mapping strategies. We particularly investigate how far mapping strategies based on space filling curves preserve query locality and achieve data load balancing depending on query patterns in comparison to a random mapping. An efficient data dissemination technique for the anticipated large data volumes is important for several use cases within scientific federations, including initial data distribution and data replication. A scalable solution should neither induce a high load on the transmitting servers nor create a high messaging overhead. Optimizing data distribution with regards to latency and bandwidth is infeasible in our scenario. Therefore, we propose several strategies that optimize network traffic, use chunk-based feeding, and improve data processing at receiving nodes in order to speed up data feeding. In the face of different typical submission scenarios, we show how community-driven data grids can adapt their query coordination strategies during query processing. We explore the impact of uniform of skewed submission patterns and compare multiple strategies with regards to their usability and scalability for data-intensive applications. Our techniques improve query throughput considerably by increased parallelism and data load balancing in both local as well as wide area deployments. Addressing skewed query workloads, so-called query hot spots, by query load balancing and directly meet the requirements of a data-intensive e-science environment is another interesting and challenging task. We enhance our data-driven partitioning schemes to trade off data load balancing against handling query hot spots via splitting and replication. We use a cost-based approach for workload-aware data partitioning. Based on these workload-aware partitioning schemes, we use master-slave replication to compensate for short-term peaks in query load and address long-term shifts in data and query distributions by partitioning scheme evolution. Our research prototype HiSbase realizes the concepts described within this thesis and offers a basis for further research shaping the data management of future scientific communities.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Policy Driven Negotiation to Improve the QoS in Data Grid

Data grids have become an interesting and popular domain in grid community (Foster and Kesselmann, 2004). Generally, the grids are proposed as solutions for large scale systems, where data replication is a well-known technique used to reduce access latency and bandwidth, and increase availability. In splitting of the advantages of replication, there are many problems that should be solved such as,

متن کامل

E2DR: Energy Efficient Data Replication in Data Grid

Abstract— Data grids are an important branch of gird computing which provide mechanisms for the management of large volumes of distributed data. Energy efficiency has recently emerged as a hot topic in large distributed systems. The development of computing systems is traditionally focused on performance improvements driven by the demand of client's applications in scientific and business domai...

متن کامل

User requirements from the academic science community drove the implementation of data management technology, and the evolution of data grid capabilities from simple data management to information and knowledge management. The SRB and iRODS systems

The Data Intensive Computing Environments group has been developing data grid technology for twenty years. Two generations of technology were created, the Storage Resource Broker SRB (1994-2006) and the integrated Rule Oriented Data System iRODS (2004-2016). Both products represented pioneering technology in distributed data management and were widely applied by communities interested in not on...

متن کامل

Scalable community-driven data sharing in e-science grids

E-science projects of various disciplines face a fundamental challenge: thousands of users want to obtain new scientific results by applicationspecific and dynamic correlation of data from globally distributed sources. Considering the involved enormous and exponentially growing data volumes, centralized data management reaches its limits. Since scientific data are often highly skewed and explor...

متن کامل

Quality-oriented and Metadata-driven Integration in Information Grids

The goal of information grids is to provide a virtually integrated view on information, which is physically stored in many distributed nodes of the grid. A user should be able to query the grid through a uniform query interface using a common data model, without knowing the details of the distribution of the data. Information grids that integrate information from heterogeneous resources have to...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008